- Day 1 - Getting started
- Day 2 - Let's code
- Day 3 - Tidyverse
- Day 4 - Plotly
- Day 5 - Shiny Introduction
- Day 6 - Reactivity
- Day 7 - Modules
- Day 8 - Shiny Project
January, 2018
install.packages("package name")install.packages("tidyverse")??function_name. E.g. ??tidyverseExample 1 - Hello World
myFunction<-function(){
print("Hello World")
}
myFunction()
## [1] "Hello World"
Example 2 - with inputs
myFunction<-function(a,b=2){
total<-a+b
return(total)
}
myFunction(1,1)
## [1] 2
myFunction(1)
## [1] 3
Example 3 - using titanic data and glm function to fit a logistic regression
install.packages("titanic")
library(titanic)
fit<-glm(
data = titanic_train,
formula = Survived ~ Sex + Age + Pclass,
family = "binomial"
)
Example 4 - use 'rio' package to read and write data from files
install.packages("rio")
data<-rio::import(file = "Data/titanic_train.csv",setclass = "tbl",integer64="double")
rio::export(x = titanic_train,file = "Data/titanic_train.csv")
When working with big data use Spark Spark is much faster than working with just R and can handle data that is of very very large size Note that not all R functions work in Spark
install.packages("sparklyr")
library(sparklyr)
spark_home_set("Spark/spark-2.2.1-bin-hadoop2.7")
sc<-spark_connect(master="local") # Create a connection to spark
data<-spark_read_csv(
sc,
"titanic",
"Data/titanic_train.csv",
memory = FALSE,
overwrite = TRUE
)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4 ## v tibble 1.3.4 v dplyr 0.7.4 ## v tidyr 0.7.2 v stringr 1.2.0 ## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ---------------------------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag()
It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
select(diamonds, cut, color, carat, price)
select(diamonds, x:z)
select(diamonds, -(x:z))
select(diamonds, starts_with("c"))
select(diamonds, ends_with("e"))
select(diamonds, contains("r"))
TIP: Move sorting variables to the start of the data frame and only keep the important variables. Variables can be renamed at the same time.
filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.
filter(diamonds, cut=="Ideal")
filter(diamonds, cut!="Ideal")
filter(diamonds, carat>=4) # <, >, ==, !=, <=, >=
filter(diamonds, cut=="Ideal" & carat>=4 )
filter(diamonds, cut=="Ideal" | carat>=4 )
filter(diamonds, cut %in% c("Ideal","Premium"))
sqrt(2)^2 == 2
near(sqrt(2)^2, 2)
arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
arrange(diamonds, cut) #A-Z arrange(diamonds, desc(cut)) #Z-A arrange(diamonds, price) #Small to large arrange(diamonds, desc(cprice)) #Large to small arrange(diamonds, cut, desc(price)) #by two or more variables
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().
TIP: Arithmetic operators are useful in conjunction with aggregate functions, e.g. X/sum(X) gives the proportion, and Y-mean(Y) computes the difference from the mean.
TIP: Offsets allows you to compute running differences (e.g. x-lag(x)) or find when values change (X != lag(X)) They are most useful in conjunction with group_by(), but make sure to sort first using arrange().
mutate( diamonds, price_p_carat = price / carat, diff = price_p_carat - mean(price_p_carat), z_score = diff / sd(price_p_carat) )
The last key verb is summarise(). It collapses a data frame to a single row. summarise() is not terribly useful unless we pair it with group_by().
summarise( diamonds, N = n(), sum = sum(price), ave1 = sum / N, SSD = sum( (price - mean(price)) ^2), SD = sqrt( SSD / (n() -1) ) )
## # A tibble: 1 x 5 ## N sum ave1 SSD SD ## <int> <int> <dbl> <dbl> <dbl> ## 1 53940 212135217 3932.8 858473135517 3989.44
summarise() is not terribly useful unless we pair it with group_by(). When you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.
TIP: group_by() is useful when calculating statistics per group. These statistics can then be easily compared.
TIP: Complicated models can also be built and then run on a group-by-group basis.
WARNING: When using group_by() with summarise() the groups get unwounded after the summarise(). That means if you group by Var1 and Var2 after doing a summary the data frame will only be grouped by Var1. Thus the order of the variables used in the group_by() matter.
diamonds_grouped <- group_by(diamonds,cut) summarise( diamonds_grouped, N = n(), average = mean(price), SD = sd(price) )
## # A tibble: 5 x 4 ## cut N average SD ## <ord> <int> <dbl> <dbl> ## 1 Fair 1610 4358.758 3560.387 ## 2 Good 4906 3928.864 3681.590 ## 3 Very Good 12082 3981.760 3935.862 ## 4 Premium 13791 4584.258 4349.205 ## 5 Ideal 21551 3457.542 3808.401
%>% is used to string functions together. This makes writing a set of logic clear and condensed.
diamonds%>% group_by(color, clarity)%>% summarise(n = n())%>% mutate(prop=n/sum(n))%>% plot_ly( x = ~color, y = ~prop, color= ~clarity,type = "bar",colors = pal_deloitte)%>% layout(barmode = "stack")